Lecture 9: August 25th, 2023#

Reminders:

  • All EDA outcome quizzes have been posted. Attempt the ones you’re missing, and let me know if any issues come up. Come to student hours it any issues come up! Anthony and I are here to help.

  • “50 years of data science” token-earning assignment due tonight at midnight. As always, this is optional.

  • I’m almost done writing the new homeworks for next week and they will be uploaded by tonight. They will be due Week 4 Friday at midnight instead of Wednesday.

Coming up:

  • On Monday, we’ll go through the instructions for the final project.

  • The planning worksheet for the final project will be due during Week 5.

Today:

  • We’ll introduce Machine Learning (ML)

  • We’ll start by coding for linear regression

  • Anthony will go through a worksheet on generating data for regression problems. Definitely go, if you are able to!

Introduction to Machine Learning#

Let’s take another fieldtrip…to the iPad!

Performing Linear Regression Using scikit-learn#

import pandas as pd
import altair as alt
import seaborn as sns
  • Import the taxis data from Seaborn.

df = sns.load_dataset("taxis")
df.sample(5)
pickup dropoff passengers distance fare tip tolls total color payment pickup_zone dropoff_zone pickup_borough dropoff_borough
4731 2019-03-20 18:13:02 2019-03-20 18:34:50 1 4.09 17.0 0.00 0.00 21.30 yellow credit card Clinton East East Harlem South Manhattan Manhattan
6417 2019-03-10 12:10:45 2019-03-10 12:43:05 2 11.17 35.0 0.00 5.76 41.56 green credit card Hillcrest/Pomonok Flatiron Queens Manhattan
4512 2019-03-02 18:22:31 2019-03-02 18:33:00 1 1.47 8.5 2.36 0.00 14.16 yellow credit card West Village Flatiron Manhattan Manhattan
4416 2019-03-09 04:14:32 2019-03-09 04:47:43 2 11.77 37.0 10.20 0.00 51.00 yellow credit card Lower East Side Washington Heights South Manhattan Manhattan
2137 2019-03-08 13:27:32 2019-03-08 13:46:23 1 1.40 12.0 3.05 0.00 18.35 yellow credit card NaN NaN NaN NaN
  • Drop rows with missing values

df = df.dropna()
  • Using Altair, make a scatter plot with “fare” on the y-axis and with “distance” on the x-axis.

alt.Chart(df).mark_circle().encode(
    x="distance",
    y="fare"
)
---------------------------------------------------------------------------
MaxRowsError                              Traceback (most recent call last)
File ~/opt/miniconda3/envs/math9/lib/python3.9/site-packages/altair/vegalite/v5/api.py:2520, in Chart.to_dict(self, *args, **kwargs)
   2518     copy.data = core.InlineData(values=[{}])
   2519     return super(Chart, copy).to_dict(*args, **kwargs)
-> 2520 return super().to_dict(*args, **kwargs)

File ~/opt/miniconda3/envs/math9/lib/python3.9/site-packages/altair/vegalite/v5/api.py:838, in TopLevelMixin.to_dict(self, *args, **kwargs)
    836 copy = self.copy(deep=False)  # type: ignore[attr-defined]
    837 original_data = getattr(copy, "data", Undefined)
--> 838 copy.data = _prepare_data(original_data, context)
    840 if original_data is not Undefined:
    841     context["data"] = original_data

File ~/opt/miniconda3/envs/math9/lib/python3.9/site-packages/altair/vegalite/v5/api.py:100, in _prepare_data(data, context)
     98 # convert dataframes  or objects with __geo_interface__ to dict
     99 elif isinstance(data, pd.DataFrame) or hasattr(data, "__geo_interface__"):
--> 100     data = _pipe(data, data_transformers.get())
    102 # convert string input to a URLData
    103 elif isinstance(data, str):

File ~/opt/miniconda3/envs/math9/lib/python3.9/site-packages/toolz/functoolz.py:628, in pipe(data, *funcs)
    608 """ Pipe a value through a sequence of functions
    609 
    610 I.e. ``pipe(data, f, g, h)`` is equivalent to ``h(g(f(data)))``
   (...)
    625     thread_last
    626 """
    627 for func in funcs:
--> 628     data = func(data)
    629 return data

File ~/opt/miniconda3/envs/math9/lib/python3.9/site-packages/toolz/functoolz.py:304, in curry.__call__(self, *args, **kwargs)
    302 def __call__(self, *args, **kwargs):
    303     try:
--> 304         return self._partial(*args, **kwargs)
    305     except TypeError as exc:
    306         if self._should_curry(args, kwargs, exc):

File ~/opt/miniconda3/envs/math9/lib/python3.9/site-packages/altair/vegalite/data.py:19, in default_data_transformer(data, max_rows)
     17 @curried.curry
     18 def default_data_transformer(data, max_rows=5000):
---> 19     return curried.pipe(data, limit_rows(max_rows=max_rows), to_values)

File ~/opt/miniconda3/envs/math9/lib/python3.9/site-packages/toolz/functoolz.py:628, in pipe(data, *funcs)
    608 """ Pipe a value through a sequence of functions
    609 
    610 I.e. ``pipe(data, f, g, h)`` is equivalent to ``h(g(f(data)))``
   (...)
    625     thread_last
    626 """
    627 for func in funcs:
--> 628     data = func(data)
    629 return data

File ~/opt/miniconda3/envs/math9/lib/python3.9/site-packages/toolz/functoolz.py:304, in curry.__call__(self, *args, **kwargs)
    302 def __call__(self, *args, **kwargs):
    303     try:
--> 304         return self._partial(*args, **kwargs)
    305     except TypeError as exc:
    306         if self._should_curry(args, kwargs, exc):

File ~/opt/miniconda3/envs/math9/lib/python3.9/site-packages/altair/utils/data.py:82, in limit_rows(data, max_rows)
     80     values = data
     81 if max_rows is not None and len(values) > max_rows:
---> 82     raise MaxRowsError(
     83         "The number of rows in your dataset is greater "
     84         f"than the maximum allowed ({max_rows}).\n\n"
     85         "See https://altair-viz.github.io/user_guide/large_datasets.html "
     86         "for information on how to plot large datasets, "
     87         "including how to install third-party data management tools and, "
     88         "in the right circumstance, disable the restriction"
     89     )
     90 return data

MaxRowsError: The number of rows in your dataset is greater than the maximum allowed (5000).

See https://altair-viz.github.io/user_guide/large_datasets.html for information on how to plot large datasets, including how to install third-party data management tools and, in the right circumstance, disable the restriction
alt.Chart(...)

Here, we get a MaxRowsError; Altair can only work with data that has less than or equal to 5000 rows.

  • Choose 5000 random rows to avoid the max_rows error.

Let’s get a random selection of 5000 rows from df. I’m not going to worry about getting reliable random rows, the point of this part is just to get a feel for what the data looks like.

alt.Chart(df.sample(5000)).mark_circle().encode(
    x="distance",
    y="fare"
)

Looking at the data, it seems to be roughly linear. It’s not perfectly linear, but we should be able to approximate a line pretty well. The only weird thing is that horizontal line…let’s see what’s going on there by adding a tooltip.

James brought up a great point: some of the rides go a distance of zero miles…and are still charged. Let’s remove these points from our data, because this seems very strange.

alt.Chart(df.sample(5000)).mark_circle().encode(
    x="distance",
    y="fare",
    tooltip=["dropoff_zone","pickup_zone","fare","distance"]
)
df2 = df.sample(5000,random_state=10)
df2 = df2[df2["distance"] > 0]
alt.Chart(df2).mark_circle().encode(
    x="distance",
    y="fare",
    tooltip=["dropoff_zone","pickup_zone","fare","distance"]
)

The horizontal line all involves rides going to or from an airport. This looks like some kind of fixed price promotion where you can go to the airport (or get picked up from the airport) and go anywhere within a region for a fixed price.

  • What would you estimate is the slope of the “line of best fit” for this data?

We have the points \((0.02,2.5)\) and \((5,16)\)

#The slope 
(16-2.5)/(5-0.02)
2.710843373493976

If I had to approximte the line, I’d say the slope is about 2.71.

There is a routine in scikit-learn that we will see many times! Starting now!

1.) Import 2.) Instantiate (create an instance of an object from an appropriate class) 3.) Fit 4.) Predict

  • Find this slope using the LinearRegression class from scikit-learn.

#1.) import
from sklearn.linear_model import LinearRegression

Create a LinearRegression object and name it reg (for regression)

#2.) Instantiate
reg = LinearRegression()
type(reg)
sklearn.linear_model._base.LinearRegression

We see reg is a linear regression object. This is not from base python, it belongs to scikit-learn.

Below, let’s try to fit the data. We’re going to get an error, and I can say that you will most likely run into this error many times on your own.

#3.) Fit
reg.fit(df2["distance"],df2["fare"])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [14], in <cell line: 2>()
      1 #3.) Fit
----> 2 reg.fit(df2["distance"],df2["fare"])

File ~/opt/miniconda3/envs/math9/lib/python3.9/site-packages/sklearn/base.py:1151, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
   1144     estimator._validate_params()
   1146 with config_context(
   1147     skip_parameter_validation=(
   1148         prefer_skip_nested_validation or global_skip_validation
   1149     )
   1150 ):
-> 1151     return fit_method(estimator, *args, **kwargs)

File ~/opt/miniconda3/envs/math9/lib/python3.9/site-packages/sklearn/linear_model/_base.py:678, in LinearRegression.fit(self, X, y, sample_weight)
    674 n_jobs_ = self.n_jobs
    676 accept_sparse = False if self.positive else ["csr", "csc", "coo"]
--> 678 X, y = self._validate_data(
    679     X, y, accept_sparse=accept_sparse, y_numeric=True, multi_output=True
    680 )
    682 has_sw = sample_weight is not None
    683 if has_sw:

File ~/opt/miniconda3/envs/math9/lib/python3.9/site-packages/sklearn/base.py:621, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, cast_to_ndarray, **check_params)
    619         y = check_array(y, input_name="y", **check_y_params)
    620     else:
--> 621         X, y = check_X_y(X, y, **check_params)
    622     out = X, y
    624 if not no_val_X and check_params.get("ensure_2d", True):

File ~/opt/miniconda3/envs/math9/lib/python3.9/site-packages/sklearn/utils/validation.py:1147, in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
   1142         estimator_name = _check_estimator_name(estimator)
   1143     raise ValueError(
   1144         f"{estimator_name} requires y to be passed, but the target y is None"
   1145     )
-> 1147 X = check_array(
   1148     X,
   1149     accept_sparse=accept_sparse,
   1150     accept_large_sparse=accept_large_sparse,
   1151     dtype=dtype,
   1152     order=order,
   1153     copy=copy,
   1154     force_all_finite=force_all_finite,
   1155     ensure_2d=ensure_2d,
   1156     allow_nd=allow_nd,
   1157     ensure_min_samples=ensure_min_samples,
   1158     ensure_min_features=ensure_min_features,
   1159     estimator=estimator,
   1160     input_name="X",
   1161 )
   1163 y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)
   1165 check_consistent_length(X, y)

File ~/opt/miniconda3/envs/math9/lib/python3.9/site-packages/sklearn/utils/validation.py:940, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
    938     # If input is 1D raise error
    939     if array.ndim == 1:
--> 940         raise ValueError(
    941             "Expected 2D array, got 1D array instead:\narray={}.\n"
    942             "Reshape your data either using array.reshape(-1, 1) if "
    943             "your data has a single feature or array.reshape(1, -1) "
    944             "if it contains a single sample.".format(array)
    945         )
    947 if dtype_numeric and hasattr(array.dtype, "kind") and array.dtype.kind in "USV":
    948     raise ValueError(
    949         "dtype='numeric' is not compatible with arrays of bytes/strings."
    950         "Convert your data to numeric values explicitly instead."
    951     )

ValueError: Expected 2D array, got 1D array instead:
array=[2.8  1.2  2.1  ... 2.68 1.6  1.47].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

What goes wrong here is that reg.fit expects a two dimensional array for the input, but we passed the pandas Series df["distance]. We should think of pandas Series as one-dimensional objects.

df2["distance"].shape
(4972,)

Notice the blank after the comma when we call shape. This is letting us know that the pandas Series in one dimension.

Observe the difference with the following:

df2[["distance"]]
distance
2871 2.80
898 1.20
845 2.10
1580 3.35
4002 10.70
... ...
1812 1.20
2191 13.11
4827 2.68
4326 1.60
5779 1.47

4972 rows × 1 columns

df2[["distance"]].shape
(4972, 1)

The example above is treated as a DataFrame with just one column. This is what happens when I pass a list df[[...]].

One way that we can remember when we did two dimensions versus one dimenion is the use of capital letters. The capital “X” means that we need two dimensions, while the lower-case “y” means we need a single dimension.

reg.fit(df2[["distance"]],df2["fare"])
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

At this point, reg has done all of the hard work of finding a linear equation that approximates our data (“fare” as a linear function of “distance”.)

Recall: The original question was asking us to find the slope. Here’s how we can get it:

Slop is stored as the coef_ attribute.

reg.coef_
array([2.72848668])

Notice that this is a NumPy array, if I wanted to extract just the number, I could do this:

reg.coef_[0]
2.7284866819996245

We had estimated before that the slope would be about 2.71, so I think we did a pretty good job :)

  • Find the intercept.

The intercept is stored as the intercept_ attribute.

reg.intercept_
4.660714229453321

Putting these together, the equation of our line is given by: $\( \text{fare} \approx 2.7284866819996245*(\text{distance}) + 4.660714229453321 \)$

Good Question from the Chat: Why does reg.intercept_ not give you an array.

Answer: It has to do with how the function looks. In our case, we had just one input that we were training on: distance. So our model looks like what we wrote above. We don’t need to just consider distance by itself, we could also consider distance, number of people, and the hour of the taxi ride. If we train on these variables, then we get 3 distinct coefficients. These coefficients will be returned in a NumPy array.

\[ \text{fare} \approx c_0*(\text{distance}) + c_1*(\text{number of people}) + c_2*(\text{time}) + \text{intercept} \]
  • What are the predicted outputs for the first 5 rows? What are the actual outputs?

df2[:5]
pickup dropoff passengers distance fare tip tolls total color payment pickup_zone dropoff_zone pickup_borough dropoff_borough
2871 2019-03-12 20:28:02 2019-03-12 20:43:16 1 2.80 12.0 3.15 0.00 18.95 yellow credit card Upper East Side South East Village Manhattan Manhattan
898 2019-03-24 13:17:38 2019-03-24 13:31:41 1 1.20 10.0 2.65 0.00 15.95 yellow credit card Murray Hill Clinton East Manhattan Manhattan
845 2019-03-04 13:22:23 2019-03-04 13:38:07 1 2.10 11.5 2.96 0.00 17.76 yellow credit card Midtown East Upper West Side South Manhattan Manhattan
1580 2019-03-21 23:31:03 2019-03-21 23:42:56 1 3.35 12.0 3.16 0.00 18.96 yellow credit card Kips Bay Lincoln Square East Manhattan Manhattan
4002 2019-03-16 08:55:35 2019-03-16 09:37:31 3 10.70 39.0 9.10 5.76 54.66 yellow credit card Manhattan Valley LaGuardia Airport Manhattan Queens

Notice, we have a distance of 2.8 and a fare of 12. The model will predict the following for a distance of 2.8:

reg.coef_*2.8 + reg.intercept_
array([12.30047694])
reg.predict(df2[:5][["distance"]])
array([12.30047694,  7.93489825, 10.39053626, 13.80114461, 33.85552173])

reg.fit' is still a little mysterious, but reg.predict` is not, it just evaluates our linear function at the distances.

Interpreting Linear Regression Coefficients#

  • Add a new column to the DataFrame, called “hour”, which contains the hour at which the pickup occurred.

df2.columns
Index(['pickup', 'dropoff', 'passengers', 'distance', 'fare', 'tip', 'tolls',
       'total', 'color', 'payment', 'pickup_zone', 'dropoff_zone',
       'pickup_borough', 'dropoff_borough'],
      dtype='object')
df2.dtypes
pickup             datetime64[ns]
dropoff            datetime64[ns]
passengers                  int64
distance                  float64
fare                      float64
tip                       float64
tolls                     float64
total                     float64
color                      object
payment                    object
pickup_zone                object
dropoff_zone               object
pickup_borough             object
dropoff_borough            object
dtype: object
df2["hour"] = df2["pickup"].dt.hour
df2.head()
pickup dropoff passengers distance fare tip tolls total color payment pickup_zone dropoff_zone pickup_borough dropoff_borough hour
2871 2019-03-12 20:28:02 2019-03-12 20:43:16 1 2.80 12.0 3.15 0.00 18.95 yellow credit card Upper East Side South East Village Manhattan Manhattan 20
898 2019-03-24 13:17:38 2019-03-24 13:31:41 1 1.20 10.0 2.65 0.00 15.95 yellow credit card Murray Hill Clinton East Manhattan Manhattan 13
845 2019-03-04 13:22:23 2019-03-04 13:38:07 1 2.10 11.5 2.96 0.00 17.76 yellow credit card Midtown East Upper West Side South Manhattan Manhattan 13
1580 2019-03-21 23:31:03 2019-03-21 23:42:56 1 3.35 12.0 3.16 0.00 18.96 yellow credit card Kips Bay Lincoln Square East Manhattan Manhattan 23
4002 2019-03-16 08:55:35 2019-03-16 09:37:31 3 10.70 39.0 9.10 5.76 54.66 yellow credit card Manhattan Valley LaGuardia Airport Manhattan Queens 8
  • Remove all rows from the DataFrame where the hour is 16 or earlier. (So we are only using late afternoon and evening taxi rides.)

That’s all we got to today! We’ll pick back up on Monday.

  • Add a new column to the DataFrame, called “duration”, which contains the amount of time in minutes of the taxi ride.

Hint 1. Because the “dropoff” and “pickup” columns are already date-time values, we can subtract one from the other and pandas will know what to do.

Hint 2. I expected there to be a minutes attribute (after using the dt accessor) but there wasn’t. Call dir to see some options.

  • Fit a new LinearRegression object, this time using “distance”, “hour”, “passengers” as the input features, and using “duration” as the target value.

Created in deepnote.com Created in Deepnote